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Abstract — In this work novel results concerning Network-on- 
Chip-based turbo decoder architectures are presented. Stemming 
from previous publications, this work concentrates first on 
improving the throughput by exploiting adaptive-bandwidth- 
reduction techniques. This technique shows in the best case an 
improvement of more than 60 Mb/s. Moreover, it is known that 
double-binary turbo decoders require higher area than binary 
ones. This characteristic has the negative effect of increasing the 
data width of the network nodes. Thus, the second contribution of 
this work is to reduce the network complexity to support double- 
binary codes, by exploiting bit-level and pseudo-floating-point 
representation of the extrinsic information. These two techniques 
allow for an area reduction of up to more than the 40% with a 
performance degradation of about 0.2 dB. 

Index Terms — Tiirbo Decoder, Network on Chip, VLSI 



I. Introduction 

Today, modem telecommunications are a pervasive experi- 
ence of data exchange among users and devices. One critical 
aspect of this scenario is the continuous demand for higher 
data rates, a problem that is exacerbated by the need for 
reliable transmission of data. To that purpose, the push on 
the so-called beyond-3G technologies, such as WiMAX HI 
and 3GPP-LTE |2|, is a possible answer, where the reliability 
is obtained exploiting effective error correcting codes, such as 
turbo O and LDPC [31 codes. Unfortunately, the decoding al- 
gorithms for these codes are iterative making high throughput 
implementations a challenging task ||5l, 161 . 

As shown in Table I in Q, several modern standards for 
communications use turbo codes as a reliable channel coding 
scheme. However, since these codes have limited similarities, 
flexible architectures able to support different standards are 
interesting solutions to achieve interoperability This di- 
rection has been investigated in several works ll9l- llT5l where 
not only flexibility but also high throughput, achieved by the 
means of parallel architectures, is addressed. As an example 
ifm . lfT2l . ifTSl deal with optimized ASIC architectures where 
flie flexibility is limited to two standards, UMTSAViMAX, 
3GPP-LTE/WiMAX and 3GPP-LTE/HSDPA respectively. On 
the other hand, ||9l, HO), US, IHj are based on the ASIP 
approach, where optimized processor-like architectures are 
used. It is worth observing that ASIP -based solutions allow 
for greater flexibility than ASIC -based architectures, as they 
can support several different codes and standards. Moreover, as 
suggested in jl3|, ASIP solutions are well suited to implement 
high throughput multiprocessor turbo decoder architectures 
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Recently, in lfT6l we introduced the concept of intra- 
IP Network-on-Chip (NoC), where the well known NoC 
paradigm is applied to the communication structure of pro- 
cessing elements that belong to the same IP. As discussed 
in several works, such as Q, |[T7l -||20). intra-IP NoC is 
a flexible solution to enable multi-ASIP turbo decoder ar- 
chitectures. However, as shown in detail in Q, ll20l . flex- 
ibility comes at the expense of increasing the complexity 
of the decoder architecture. In this work we improve the 
complexity/performance trade-off of NoC-based turbo decoder 
architectures by reducing the traffic load on the network as 
suggested in 1211 . The adopted technique of traffic reduction 
offers in the best case a throughput improvement of more than 
60 Mb/s and 40 Mb/s for binary and double-binary codes 
respectively. Furthermore, we exploit two known techniques 
1221 . l23l . originally proposed to limit the amount of memory 
in turbo decoder architectures, as possible solutions to reduce 
the complexity of the NoC when double-binary turbo codes 
l24ll are employed, as in the WiMAX standard. 
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Figure 1. Parallel concatenation of two convolutional codes: encoder (a), 
decoder (b), notation for a trellis section (c) 

The paper is structured as follows: in section |ll] we recall 
the equations required to implement the decoding algorithm, 
whereas in section |lll] we describe the peculiar characteristics 
of an NoC-based turbo decoder architecture, including the 
architecture of routing elements, low-complexity routing algo- 
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Figure 2. Node block scheme: (a) FA aixhitecture, (b) AP architecture, (c) PP architecture 



rithms and topologies. Section ITVl describes the experimental 
setup we defined to increase the throughput and reduce the area 
of NoC-based turbo decoder architectures both in the case of 
binary and double-binary codes. To this purpose we considered 
the HSDPA and the 3GPP-LTE standards for the case of binary 
codes, and the WiMAX standard for the case of double-binary 
codes. Finally, in section [Vl conclusions are drawn. 

II. Decoding algorithms 

Since turbo codes are based on the concatenation (usually 
parallel) of two constituent Convolutional Codes (CC) (Fig. 
[U (a)), the decoder is made of two constituent decoders that 
exchange their data by means of an interleaver (11) and a 
deinterleaver (n~^), see Fig. [T] (b). For the sake of brevity 
in the next paragraph we define the symbols used in Fig. [r|(a) 
and (b) without specifying if they are related to CCl or CC2. 

The decoding algorithm of turbo codes is an iterative 
process made of two half iterations, one for each constituent 
decoder, where each half iteration is based on Maximum-A- 
Posteriori (MAP) estimation achieved by means of the BCJR 
algorithm lIZSl . where Log-Likelihood-Ratio (LLR) represen- 
tation is usually adopted 1261 . Based on the trellis notation 
shown in Fig. [T] (c) and said U the set of uncoded symbols, 
each constituent MAP decoder, often referred to as Soft-In- 
Soft-Out (SISO) module, computes 

Xrn^ max {fo-*(e)}- max {fo-*(e)}-ArM (D 

e\u\e)—u e:u[e)—u 

where w G W is a uncoded symbol taken as a reference 
(usually u ^ 0), u £ U \ {u}, fc is a trellis step, e is 
a transition in a trellis step and u{e) is the corresponding 



uncoded symbol. Thus, A|^*[m] and A^^'^[tt] are extrinsic and 
a-priori information respectively for symbol u at trellis step 
k expressed as LLRs. The maxjxi} function is implemented 
as max{xi} followed by a correction term often stored in a 
small Look-Up-Table (LUT) 1271 . Il28l . The correction term, 
usually adopted when decoding binary codes (Log-MAP), can 
be omitted for double-binary turbo codes with minor error rate 
performance degradation (Max-Log-MAP). 
The term 6'"^*(e) in ([TJ is defined as: 

6-*(e) =afc_i[s^(e)] +7rN +/3fc[s^(e)] (2) 

ak[s]= max {ak-i[s^ (e)] + -fk[e\} (3) 

e:s^ {e) — s 

I3k[s]= max {/3fc+i[s^(e)] + 7fe[e]} (4) 

7fc[e] = Afe[u(e)] + Afc[c"(e)]+Afc[cP(e)] (5) 
= Afc[cP(e)] (6) 

where s^(e) and s^{e) are the starting and the ending states 
of e, ak[s^{e)] and /3fc[s^(e)] are the forward and backward 
metrics associated to s^{e) and s^(e) respectively. The terms 
Afc[u(e)], Afe[c"(e)] and 7^^*[cP(e)] are obtained adding the 
corresponding a-priori, intrinsic systematic and intrinsic parity 
LLRs respectively. 

In a parallel decoder P SISOs operate concurrently on 
disjoint portions of the trellis. Said N the number of trellis 
steps processed by each constituent decoder, we have that 
each SISO operates on a trellis slice made of N/P steps. 
As a consequence, we can extend the notation introduced in 
the previous paragraph to a parallel decoder, where A^^* [u] is 
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Figure 3. BER performance of the HSDPA N = 5114 turbo decoder with 
ABR technique for different values of K 



the extrinsic information produced by SISO i at the j-th trellis 
step. For further details on the decoding algorithm the reader 
can refer to JS]. 

III. NOC-BASED TURBO DECODER ARCHITECTURES 

An NoC-based turbo decoder architecture can be repre- 
sented as a graph where each node is made of a Routing 
Element (RE) and a Processing Elements (PE) (see Fig. 
Each PE, devoted to perform the processing required by the 
BCJR algorithm, contains a SISO processor and two memories 
where intrinsic and a-priori information are stored respectively. 
On the other hand, each RE has a simple structure made of 
AI input buffers (FIFOs), an M x M crossbar switch and M 
output registers. REs are devoted to route the data produced by 
PEs to the correct destination node according to 11 and 11^^. 
To this purpose we introduce d{i, j) as the destination node of 

J* [u] . In order to complete a half iteration, A^^* [u] is stored 
at the location t{i,j) in the a-priori information memory of 
node d{i,j). 

In general PEs and REs can operate at different rates, thus, 
to decouple the design of PEs and REs we define R as the 
number of packets injected in the network in a clock cycle. 
As a consequence, R ~ 1 means that each PE injects in the 
network one new packet per clock cycle, whereas R — 0.5 
means that a new packet is injected in the network every 
two clock cycles. It is worth noting that the case R ~ 1 
corresponds to REs and PEs working at the same clock 
frequency (isochronous), with PEs able to output new packet 
of extrinsic information at each clock cycle. On the contrary, 
R < 1 models either an isochronous system where PEs output 
less that one packet per clock cycle or a mesochronous system 
where REs work at a higher clock frequency that PEs. 

A. RE architectures 

In II20II three possible architectures for REs (see Fig. |2), 
referred to as Fully-Adaptive (FA), All Precalculated (AP) and 
Partially Precalculated (PP) architectures were presented. 



The FA architecture (Fig. |2] (a)) sends on the network 
packets of data made of a header, containing d{i,j) and a 
payload containing Af and t{i,j). The data are routed by 
the means of a Routing Algorithm (RA). 

The AP architecture (Fig. |2](b)) is obtained observing that: 
given n and 11^^ we have 
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where [-J is the next lowest integer value and can be 
either n( ) or n~^( ) depending on the current half iteration. 
As a consequence, for each node we can precalculate and store 
in a Routing Memory (RM) and in a Location Memory the 
routing information and t{i, j), the location where the received 
value A^^*[ii] will be stored, respectively. Thus, with the AP 
architecture we reduce the width of the data bus at the expense 
of some extra memory. 

The PP architecture (Fig.|2](c)) only precalculates the j) 
sequences thus, it requires a narrower data width than the FA 
architecture, but less memory than the AP one. 

To improve the throughput/area figures of NoC-based turbo 
decoder architecture we infer from ll20l two main results: 
« The AP architecture can be conveniently used with 
complex routing algorithms to concurrently maximize 
the throughput and minimize the area. Unfortunately, 
as pointed out in l?] this comes at the expense of 
a significant amount of external memory to store the 
routing information; as an example to support all the 
interleavers specified by the HSDPA standard ll29l about 
64 MB of memory are required. 
• As long as the network is faster than the PEs {R < 1), 
throughput and area figures tend to be independent of the 
routing algorithm. 
Thus, both FA and PP architectures with simple RAs, should 
be further investigated. In particular, the performance of the 
FA architecture can be improved by using Adaptive Band- 
width Reduction (ABR) techniques as the one proposed in 
|21|, namely avoiding the exchange of unnecessary extrinsic 
information values. This distinguishing feature of the FA 
architecture, that is not available with AP and PP architectures, 
is detailed in section HV-AI On the contrary, the PP architecture 
features a narrower data bus than the FA one, however, it 
requires some external memory to store the configurations of 
all the Location Memories. Moreover, in several standards, 
such as HSDPA, 3GPP-LTE and WiMAX, the generation of 
d{i,j) and t{i,j) sequences can be obtained algorithmically 
with simple architectures ifTZl . Il30l . lISTI . As a consequence, 
the FA architecture can also take advantage of this feature to 
reduce the complexity of the whole decoder 

B. Low complexity RAs 

In order to increase the throughput and reduce the area 
of the decoder, RAs should be based on simple, deadlock- 
free routing policies than can be implemented with few logic 
and completed in one clock cycle. As suggested in 1201 




Figure 4. BER performance of the 3GPP-LTE N = 6144 turbo decoder 
with ABR technique, K = 4, 6, 8, 10 
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Figure 5. Average thi'oughput improvement at different SNR values of the 
HSDR\ N = 5114 turbo decoder with K = 4,6, 8, 10 on generalized Kautz 
networks D = 2 and P = 64 



Round-Robin (RR) and FIFO-length (FL) are suitable policies 
for NoC-based turbo decoders. RR is based on a circular 
serving policy, whereas with FL policies each input is served 
considering the number of elements stored in its input buffer, 
namely FL sorts the input buffers according to the number 
of stored elements, then it serves them in decreasing order 
Routing paths are stored into a routing table; for each couple 
of nodes in the network, one shortest-path is stored in the 
routing table. This approach, where only one shortest-path is 
considered, will be referred to as Single-Shortest-Path (SSP) 
II20I in the rest of the paper. 

C. NoC topologies 

In II20I several fixed degree topologies for NoC-based turbo 
decoder architectures are considered. However, since 11 and 
tend to spread almost uniformly Xf* [u], the traffic pattern 
on the network is almost uniform too. Experimental results 
in II20I show that topologies with logarithmic diameter as 
generalized De-Bruijn ||32) and generalized Kautz ll33l achieve 
higher throughput and require lower area than other well 
known fixed degree topologies such as ring, honeycomb and 
toroidal-mesh ones. 

IV. Experimental setup 

Since in this work we aim at increasing the throughput and 
reducing the area of NoC-based turbo decoder architectures, 
we focus on the most significant cases discussed in section 
Iml namely FA node architecture with SSP-RR and SSP-FL 
routing algorithms. Moreover, we consider only generalized 
Kautz topologies, as they have logarithmic diameter and less 
self-loop^ than generalized De-Bruijn ones ||20) . ll32l . ll33l . 
The degree of the network D = AI —\ ranges in {2,3,4} and 
the parameter R varies in {0.33,0.5, 1}. Then we simulated 
both HSDPA and 3GPP-LTE interleavers for the case of binary 

'if we model a topology as a graph, a self-loop is an edge whose source 
and destination nodes coincide. 
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Figure 6. Average throughput improvement at different SNR values of the 
HSDPA N = 5114 turbo decoder with K = 4,6, 8, 10 on generalized Kautz 
networks D = 3 and P = 64 



turbo codes. Furthermore, we simulated the double -binary 
turbo code used in the WiMAX standard as well. 
In the following the throughput is computed as 



T 



ik 



^cyc 



(9) 



where Ni, is the number of decoded bits, fdk is the clock 
frequency, / is the number of iterations, Nq^""' and N^^'^ 
are the number of clock cycles required to complete the 
interleaved and deinterleaved half iterations respectively. It 
is worth pointing out that Nf, = N for binary codes and 
Nb = 2N for double-binary codes. Results shown in the 
following sections have been obtained for fdk = 200 MHz 
and / = 8 with the Turbo-NoC simulator ll34l and Synopsys 
Design Compiler for a 130 nm standard cell technology. 
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Figure 7. Average thi'oughput improvement at different SNR values of the 
HSDPA N = 5114 turbo decoder with K = 8, 10 on generalized Kautz 
networks D = 4 and P = 64 



A. ABR in NoC-based turbo decoder architectures 

According to 1211 the throughput of an NOC -based turbo 
decoder can be increased by reducing the amount of data 
injected into the network. This approach is similar to well 
known early stopping criteria that are routinely used to both 
increase the throughput and reduce the power consumption 
in turbo decoder architectures [[35l . However, most of related 
works focus on frame-level early stopping criteria. On the 
contrary, bit-level/symbol-level early stopping criteria 1361 
take into account that the reliability of each bit/symbol in a 
frame converges at different speed. As a consequence, when 
the extrinsic information of a certain bit/symbol meets a proper 
reliability criterion, it is not necessary to further refine it. 
From an NoC-based turbo decoder perspective, this means that 
reliable AJ"^* [u] are no longer sent over the network. 

B. HSDPA and 3GPP-LTE case of study 

For binary turbo codes, as the ones employed in HSDPA 
and 3GPP-LTE standards, a simple ABR technique is obtained 
by fixing a threshold K that is compared with S = |Af^*[u] — 



A"^'^[m]|, namely if 5 < K, then A,f^*[u] is not sent. The choice 
of K depends not only on the specific code considered but 
also on the quantization parameters used to represent A^^* [u] 
and on the performance loss in terms of Bit-Error-Rate (BER) 
that can be accepted. In the following we consider = 5114 
for HSDPA and N = 6144 for 3GPP-LTE respectively. In 
both cases the extrinsic information is represented on eight 
bits whereas the intrinsic information is represented on six 
bits with three fractional bits. Both decoders perform eight 
iterations (/ = 8) with P = 64 using the Log-MAP algorithm 
II27I with a LUT-stored correction term. In Fig. |3]and 21 we 
show the BER performance for the HSDPA and 3GPP-LTE 
codes respectively obtained by applying the ABR technique 
described in the previous paragraph with several valued for 

"Since we use three fractional bits for data representation the integer values 
of K we considered coiTespond to 0.25, 0.75, 1 and so on. 



K. In particular, in Fig. |3]we show for the HSDPA code that 
when K > 10 the performance worsens significantly. As an 
example, with K ~ 10 there is a performance loss of less than 
0.1 dB in the waterfall region and nearly ideal performance 
when the code floors. On the other hand, with K = 16 the 
performance loss is of about 0.2 dB in the waterfall region and 
the code floor is shifted to higher SNR values of about 0.2 dB 
as well. Similar results were observed for the 3GPP-LTE code, 
so, for the sake of clarity, in Fig. |4]only results obtained with 
K ~ 4, 6, 8, 10 are shown. For both cases we obtained the cor- 
responding best and average bandwidth reduction at different 
SNR values through Monte Carlo simulation^. Experimental 
results show that the throughput increase is significant when 
there is a high load on the network (i? = 1) either using FL or 
RR routing algorithm. In particular, in Fig. |5] |6]and|7]we show 
the average throughput increase for the HSDPA turbo decoder 
for D = 2,3,4 respectively with different values of K. As it 
can be observed when R ~ 1 there is an average throughput 
increase, with respect to a decoder where ABR is not applied, 
that ranges from about 5 to 20 Mb/s for the HSDPA turbo 
decoder Furthermore, we observed that in the best case there 
is a throughput increase of at least 60 Mb/s. On the other 
hand, when R < 1 the average throughput improvement is at 
most of 5 Mb/s. Similar results have been obtained for the 
LTE turbo decoder 

To complete the comparison, we show in Table |T] the 
throughput/area results for the HSDPA and LTE cases respec- 
tively, where the results for the HSDPA case with ASP-FT 
routing algorithm and AP node architecture are taken from 
l?]- As it can be observed the significant throughput increase 
obtained with the ABR technique on the FA node architecture 
when i? = 1 is paid as an area overhead with respect to the 
AP node architecture. However, as pointed out in [TJ, the AP 
node architecture requires a large external memory to store the 
routing information. Moreover, the difference in terms of area 
between FA and AP node architectures reduces when R < 1. 
In particular, as shown in Table H] when R = 0.33 with P = 8 
and P = 16 the FA node architecture with the SSP-FL routing 
algorithm requires less area than the AP one. 

C. WiMAX case of study 

Simulation results shown in this section have been obtained 
with N = 1920, as in l23]| . Each component of the extrinsic 
information is represented on eight bits whereas the intrinsic 
information is represented on six bits with two fractional bits. 
The decoder performs eight iterations (/ = 8) with P = 64 
using the Max-Log-MAP algorithm 127]. 

Since in binary turbo codes U = {0, 1}, the LLR of the 
extrinsic information is a scalar value. On the other hand, 
for double-binary turbo codes U = {00,01,10,11}, as a 
consequence A^^*[u] is an array containing three elements. 
In |23|. a bit level double-binary turbo decoder architecture 
is proposed to reduce the amount of memory to store the 
extrinsic information. The same idea is exploited in this work 
to reduce the area overhead of the NoC. Basically, a double- 
binary uncoded symbol m can be represented as a couple of 

^The worst case con'esponds to simulations where ABR is not applied 
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Table I 

Throughput [Mb/s] - area [mm^] achieved with the HSDPA N = 511A and LTE N = 6144 interleavers, with generalized Kautz 

TOPOLOGIES FOR P G {8, 16,32,64}, R G {0.33, 0.5, 1}, SSP-RR, SSP-FL AND ASP-FT ROUTING ALGORITHMS. NO ABR 
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199 - 3.89 
246 - 1.99 
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372 - 3.31 
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76 - 1.45 
86 - 0.78 
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151 - 1.44 
151 - 1.12 
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312 - 5.68 
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SSP-RR (FA) 
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Figure 8. BER performance of the WiMAX N = 1920 turbo decoder with 
SL, BL, PFP representation and ABR technique, K = 4,6 



Figure 9. Average thi'oughput improvement at different SNR values of the 
WiMAX A'' = 1920 turbo decoder with K = 4,6 on generalized Kautz 
networks D = 2 and P = 64 



binary random variables AB. Then, with a slight abuse of 
notation, said X a binary random variable, we denote X = 
with X and X = 1 with X. Resorting to the Max-Log-MAP 
approximation we can convert Symbol-Level (SL) LLRs to 
Bit-Level (BL) LLRs as 

\tf[A] ^ ^iA~^iA (10) 
x^f[B] ^ ^iB-^iB (11) 



where 

= ^a.^{K:j[m.Kf[m} (12) 

^l-J = max{0,A^_f (13) 

MB = u,&^{\f^l\AB],Xif[AB]} (14) 

At-g = max{0,A^_f [^:B]}. (15) 

Similarly, we can convert BL LLRs to SL LLRs with the 
following approximations. 
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s — FL no ABR 
-<) — FL A" = -J 
— FL A" = () 
a - RR no ABR 
- RR A = 
X - RR A = 



<^ O 



« a (y 



-o- 



representation of Afj*[A] and Afj*[i?] (as 2's complement 
values) from the most significant bit to the least significant 
bit and to detect the first zero-one or one-zero transition, 
which represents the starting bit of the extrinsic information 
significand. We denote the significand as ^ and the number 
of bits that prefix ^ are coded as a shift index a. Thus, 
for each couple Xlf[A], Xlf[B] we obtain ^tj[A], 
cri.j[j4] and (7i,j[B]. Then, according with 1221 . we impose 
cr,;,j = min{crij[^], [S]}. Said n\, and n^r the number 
of bits to represent A, ^ and a respectively we obtain 



SNH [dB] 



= Xtf[A]» {nx~n^~a,,,) (29) 
Lj[B] = Xtf[B]»inx-n.-a,,,) (30) 

where > > stands for arithmetic right shift. As a consequence, 
the payload of each packet sent on the network now contains 
£,ij[B] and aij instead of A^'^*[u]. 



Figure 10. Average thi'oughput improvement at different SNR values of the 
WiMAX A'" = 1920 turbo decoder with = 4, 6 on generahzed Kautz 
networks D = 3 and P = 64 
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< 
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Xtf[A] 



a: 



3) 



\ ext 
^t,3 



< and A'^f 



B] 



4) 



\ ext 



[A] 



[AB] 
[AB] 
... \AB] 
< and Aff 



\ ext 
^t,3 
\ ext 
^h3 
yext 



B] 



>\3^ 

> 


\ ext 

\ ext 
^3 

< 



A]+A-*[S] 



B] 
A] 



\ ext 
\3 



[B] 



ext 

i,3 

ext 

i,3 

ext 

i,3 



[AB] 
[AB] 
[AB] 



\ ext 

A, J 

\ ext 

\ ext 
\3 



[A] 
[B] 

[A]+X'i^l[B] 



fJ-AB 



where 



IJ'AB 



max{A,^_f [A], 



\ ext 
\3 



B]} 



(16) 
(17) 
(18) 

(19) 
(20) 
(21) 

(22) 
(23) 
(24) 

(25) 
(26) 
(27) 

(28) 



For further details on bit to symbol and symbol to bit conver- 
sion the reader can refer to ll23l . 

The use of BL LLRs introduces a BER performance loss 
of about 0.2 dB (see Fig. [Fl, but it reduces the data width of 
one third with respect to SL LLRs, as the payload of each 
packet contains A^^*[A] and A^^*[J3] instead of A^^*[m]. To 
further reduce the data width we applied to BL LLRs the 
Pseudo-Floating-Point (PFP) representation suggested in ll22l . 
As highlighted also in ifjTl . lf38l the most significant bits of the 
extrinsic information play an important role in the decoding 
procedure. Indeed, the basic idea is to analyze the binary 



- FL no ABR 

- FL A = 4 

- FL A = 6 

- RR no ABR 

- RR 7y = 4 

- RR 7\ = (5 



0.2 0.4 



SNR [dB] 



Figure 1 1 . Average throughput improvement at different SNR values of the 
WiMAX A*" = 1920 turbo decoder with A' = 4, 6 on generalized Kautz 
networks D = 4 and P = 64 

As stated in the first paragraph of this section n\ = 8. Thus, 
said rid the number of bits devoted to represent the extrinsic 
information in the payload we have: i) rid = 3rtA = 24 for 
Xff[u] and ii) rid = 2nx = 16 for Xff[A] and Xlf[B]. If we 
impose = 4, we obtain aij < 4 and so ?icr = 3, leading to 
rid = 2nj + tIct = 11 that is less than half the value of rid for 
A^^*[m]. As shown in Fig. [8] the BER performance loss of BL, 
PFP LLR representation, is nearly the same as the fixed point 
BL one. In Table |II| the throughput and area results obtained 
by using SL and BL, PFP LLR representation are shown for 
generalized Kautz topologies. As it can be observed, the area 
decrease as a function of rid is not linear, however, it becomes 
particularly interesting when R = 1. As an example, with 
R = 1, D = A and P = 64 there is an area saving of up to 
the 40%. 

The techniques described in the previous paragraphs are all 
aimed at reducing the area of the NoC -based turbo decoder 
Furthermore, the ABR technique described in section HV- A l ean 
be used to improve the throughput as well. In order to limit 
the BER performance loss introduced by the ABR technique. 
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Table II 

Throughput [Mb/s] - area SL [mm^] - area BL [mm^], PFP achieved with the WiMAX N = 1920 interleaver, with generalized Kautz 

TOPOLOGIES FOR P G {8, 16,32,64}, R G {0.33, 0.5, 1}. SSP-RR, SSP-FL AND ASP-FT ROUTING ALGORITHMS. NO ABR 







D=2 






P=8 


P=I6 


P=32 


P=64 


fi=LOO 


SSP-RR (FA) 
SSP-FL (FA) 
ASP-FT (AP) 


104 - 2.15 - 1.46 

105 - 2.17 - 1.47 
105 - 1.62 - 0.92 


138 - 3.61 - 2.43 
144 - 3.40 - 2.30 
144 - 2.54 - 1.43 


195 - 5.16 - 3.51 
208 - 4.57 - 3.13 
208 - 3.42 - 1.99 


264 - 6.97 - 4.85 
285 - 6.11 - 4.29 
285 - 4.61 - 2.79 


i?=0.50 


SSP-RR (FA) 
SSP-FL (FA) 
ASP-FT (AP) 


86 - 0.47 - 0.38 
86 - 0.42 - 0.35 
86 - 0.42 - 0.35 


127 - 1.48 - 1.07 
134 - 1.19 - 0.88 
134 - 1.00 - 0.69 


176 - 3.20 - 2.26 
187 - 2.74 - 1.96 
187 - 2.15 - 1.37 


231 - 5.33 - 3.80 
246 - 4.60 - 3.33 
246 - 3.56 - 2.29 


i?.=0.33 


SSP-RR (FA) 
SSP-FL (FA) 
ASP-FT (AP) 


58 - 0.39 - 0.33 
58 - 0.36 - 0.31 
58 - 0.38 - 0.33 


102 - 0.78 - 0.62 

103 - 0.69 - 0.57 
103 - 0.67 - 0.54 


153 - 1.94 - 1.45 
161 - 1.55 - 1.20 
161 - 1.33 - 0.99 


199 - 4.05 - 2.98 
209 - 3.41 - 2.56 
209 - 2.73 - 1.89 






D=3 






P=8 


P=I6 


P=32 


P=64 


R=1.00 


SSP-RR (FA) 
SSP-FL (FA) 
ASP-FT (AP) 


148 - 1.07 - 0.77 
165 - 0.69 - 0.53 
165 - 0.59 - 0.43 


204 - 2.19 - 1.53 
218 - 2.10 - 1.48 
249 - 1.34 - 0.86 


306 - 3.59 - 2.53 
328 - 3.39 - 2.41 
344 - 2.57 - 1.60 


432 - 5.70 - 4.08 
448 - 5.39 - 3.89 
452 - 4.07 - 2.59 


R=0.50 


SSP-RR (FA) 
SSP-FL (FA) 
ASP-FT (AP) 


87 - 0.47 - 0.38 
87 - 0.45 - 0.37 
87 - 0.47 - 0.39 


152 - 0.90 - 0.71 
152 - 0.85 - 0.68 
152 - 0.80 - 0.63 


243 - 1.80 - 1.38 
242 - 1.68 - 1.31 

244 - 1.43 - 1.07 


333 - 3.87 - 2.91 

334 - 3.45 - 2.64 
338 - 2.75 - 1.96 


R=0.33 


SSP-RR (FA) 
SSP-FL (FA) 
ASP-FT (AP) 


58 - 0.44 - 0.37 
58 - 0.42 - 0.35 
58 - 0.44 - 0.37 


103 - 0.84 - 0.67 
103 - 0.80 - 0.64 
103 - 0.76 - 0.60 


168 - 1.64 - 1.28 

167 - 1.57 - 1.23 

168 - 1.38 - 1.05 


243 - 3.27 - 2.53 
243 - 3.10 - 2.41 
243 - 2.57 - 1.89 






D=4 






P=8 


P=16 


P=32 


P=64 


R=1.00 


SSP-RR (FA) 
SSP-FL (FA) 
ASP-FT (AP) 


135 - 1.24 - 0.88 
153 - 0.94 - 0.69 
166 - 0.63 - 0.46 


253 - 1.79 - 1.29 
279 - 1.41 - 1.05 
279 - 1.15 - 0.80 


323 - 3.41 - 2.45 
344 - 3.21 - 2.33 
393 - 2.28 - 1.51 


513 - 5.38 - 3.93 
533 - 5.09 - 3.76 
533 - 3.99 - 2.66 


R=0.50 


SSP-RR (FA) 
SSP-FL (FA) 
ASP-FT (AP) 


87 - 0.50 - 0.41 
87 - 0.48 - 0.39 
87 - 0.52 - 0.43 


155 - 0.92 - 0.73 
155 - 0.90 - 0.72 
155 - 0.87 - 0.69 


247 - 1.98 - 1.53 

248 - 1.89 - 1.47 

249 - 1.66 - 1.24 


354 - 3.83 - 2.94 
356 - 3.70 - 2.84 
356 - 3.12 - 2.27 


R=033 


SSP-RR (FA) 
SSP-FL (FA) 
ASP-FT (AP) 


58 - 0.47 - 0.39 
58 - 0.44 - 0.36 
58 - 0.48 - 0.40 


104 - 0.92 - 0.73 
104 - 0.88 - 0.70 
104 - 0.84 - 0.66 


169 - 1.85 - 1.45 
169 - 1.75 - 1.37 
169 - 1.57 - 1.19 


248 - 3.61 - 2.80 
248 - 3.45 - 2.67 
248 - 2.97 - 2.19 



we employ the SL reliability criterion proposed in |l2T| but we 
send BL, PFP extrinsic information when the criterion is not 
met. The ABR technique we used is summarized in Algorithm 
[Tland can be summarized as follows: said ■d'^^f, p"^' and 

Algorithm 1 SL reliability criterion proposed in [ST) 

3: f ^ max{Aff [m]} 

4: ef^* ^max{A^f M \i9^f } 

^ Aext J \j\ext ^exti 

6- ^i.j ^ Wi.j - Qi,j I 

8: if < K then 

9: do not send any packet 

10: else 

11: send l»,j[-B], cTi J 

12: end if 



gf^^ the first and the second maximum values in X^^^ [u] and 
A^'^*[m] respectively, we compute A"^'^ = li?"^*^ — q1^[\ and 
Aff = I; finally, we compared, J = jAff-A^^H 

with the threshold K. 

As shown in Fig. |8]the BER performance loss introduced by 
the ABR technique is negligible. Moreover, as shown in Fig.|9] 
[TOl and nn when R = l the ABR technique induces an average 
throughput increase of about 5 to 20 Mb/s. Similarly to the 
binary codes in the best case the throughput improvement is at 
least of more than 40 Mb/s, whereas when R <1 the average 
throughput improvement is at most of 5 Mb/s. 



V. Conclusions 

In this work ABR techniques have been exploited to im- 
prove the throughput of NoC-based turbo decoder architec- 
tures. When the load of the network is high the average 
throughput is improved of about 5 to 20 Mb/s and in the 
best case the throughput is increased of more than 60 Mb/s 
and 40 Mb/s for binary and double-binary codes respectively. 
Moreover, the area required to support double-binary codes 
has been significantly reduced (up to more than the 40%) by 
applying BL, PFP representation of the extrinsic information 
with a BER performance loss of about 0.2 dB. 
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